This is an example notebook to demonstrate the jupyter_spark
notebook plugin.
It is based on the approximating pi example in the pyspark documentation. This works by sampling random numbers in a square and counting the number that fall inside the unit circle.
In [1]:
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
Create a SparkSession
and give it a name.
Note: This will start the spark client console -- there is no need to run spark-shell
directly.
In [2]:
spark = SparkSession \
.builder \
.appName("PythonPi") \
.getOrCreate()
partitions
is the number of spark workers to partition the work into.
In [3]:
partitions = 2
n
is the number of random samples to calculate
In [4]:
n = 100000000
This is the sampling function. It generates numbers in the square from (-1, -1) to (1, 1), and returns 1 if it falls inside the unit circle, and 0 otherwise.
In [5]:
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
Here's where we farm the work out to Spark.
In [6]:
count = spark.sparkContext \
.parallelize(range(1, n + 1), partitions) \
.map(f) \
.reduce(add)
In [7]:
print("Pi is roughly %f" % (4.0 * count / n))
Shut down the spark server.
In [8]:
spark.stop()